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Abstract 

We propose the following simple stochastic model for phylogenetic trees. New types are 
born and die according to a birth and death chain. At each birth we associate a fitness to the 
new type sampled from a fixed distribution. At each death the type with the smallest fitness 
is killed. We show that if the birth (i.e. mutation) rate is subcritical we get a phylogenetic 
tree consistent with an influenza tree (few types at any given time and one dominating type 
lasting a long time) . When the birth rate is supercritical we get a phylogenetic tree consistent 
with an HIV tree (many types at any given time, none lasting very long). 

1 Introduction 

The influenza phylogenetic tree is peculiar in that it is very skinny: one type dominates for a 
long time and any other type that arises quickly dies out. Then the dominating type suddenly 
dies out and is immediately replaced by a new dominating type. The models proposed so far 
are very complex and make many assumptions. See for instance Koelle et al. (2006) and van 
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Nimwegen (2006). We would like to use a simple stochastic model for such a tree. The other 
motivation for this work comes from the comparison between influenza and HIV phylogenetic 
trees. An HIV tree is characterized by a radial spread outward from an ancestral node, in sharp 
contrast with an influenza tree. Moreover, Korber et al. (2001) note that the influenza virus 
is less diverse worldwide than the HIV virus is in Amsterdam alone. However, both types of 
trees are supposed to be produced by the same basic mechanism: mutations. Can the same 
mathematical model produce two trees that are so different? Our simple stochastic model will 
show a striking difference in behavior depending on the mutation rate. 

Our model has a birth and death component and a fitness component. For the death and 
birth component we do the following. If there are n > 1 types at a certain time t then there is 
birth of a new type (by mutation) at rate nX. We think of a birth as the appearance of one new 
type, not the replacement of one type by two new types. If there are n > 2 types then there is 
death of one type at rate n. If only one type is left it cannot die. That is, 

n — ► n + 1 at rate nX 

n — ► n — 1 at rate n if n > 2. 

Moreover, each new individual is assigned a fitness value chosen from a fixed distribution, 
independently each time. Every time there is a death event then the type that is killed is the one 
with the smallest fitness. Since all that matters is the ranks of the fitnesses, we might as well 
take their distribution to be uniform on [0, 1]. For simplicity the process is started with a single 
type. 

We give no specific rule on how to attach a new type after a birth to existing types (in order 
to construct a tree). Our results do not depend on such a rule. Two natural possibilities are 
to either attach the new type to the type which has the maximum fitness or to a type taken at 
random. 

Theorem 1. Take a £ (0, 1). 
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// A < 1, then 

lim P(maximal types at times at and t are the same) = a, 

t— >oo 

while if A > 1, then this limit is 0. 

We see that if A < 1, the dominating type (i.e. the fittest type) at time t has likely been 
present for a time of order t and at any given time there will not be many types. This is consistent 
with the observed structure of an influenza tree. On the other hand, if A > 1, then the dominating 
type at time t has likely been present for a time of order shorter than t and at any given time 
there will be many types. This is consistent with an HIV tree. 

2 Proof of Theorem 1 

The proof divides into three cases, depending on whether the birth and death chain is positive 
recurrent, null recurrent, or transient. We present them in order of difficulty. 

2.1 Case A < 1 

Let Ti, T2, ... be the (continuous) times between successive visits of the chain to 1, T n = nH hr n , 

<7i,o"2,... be the number of new types introduced in cycles between successive visits to 1, and 
S n = 1 + G\ + • • • + o n . Note that the r's and cr's are not independent of each other, but the 
sequence (ti, cti), (t2, 02), ■•• is i-i.d. and independent of the fitness sequence. Define the usual 
renewal process N(t) corresponding to the r's by {N(t) = n} = {T n <t< T n+ \}. 

For < s < t, recalling that Tjv(t) < t < Xjv(t)+i, and noting that the maximal type is 
increasing in time, we see that 

P(maximal types at times s and t are the same, N(s) < N(t)) (1) 

lies between 

P(maximal types at times T N ^ and T N ^ +1 are the same, N(s) < N(t)) 
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and 

P(maximal types at times T/v( s )+i and T N ^ are the same, N(s) < N(t)). 

Let T be the c-algebra generated by (ti, cri), (T2, 02), .... Then for k < I, since the fitness sequence 
is i.i.d. and independent of J 7 , 

^(maximal types at times and T\ are the same | JF) = 

More precisely, conditional on T there are S\ fitnesses observed by time 7} and Sj- of them are 
observed by time 7\. We claim that in n i.i.d. observations the probability that the largest occurs 
among the first m is m/n, since any one of n is equally likely to be the largest. Since N(s) and 
N(t) are T measurable, it follows that (1) lies between 



E 



and E 



S -f^,N(s)<N(t) 



(2) 



Since A < 1, Et < 00, and the renewal theorem gives 

N(s)/s 1/Et a.s., 

while the strong law of large numbers gives S^^/N(s) — > E'er a.s., so that S^^/s — > Ect/Et 
a.s. It follows by the bounded convergence theorem that 

lim P(maximal types at times at and t are the same) = a. (3) 

This completes the proof of Theorem 1 in the subcritical case. 

2.2 Case A > 1. 

Define the r's and cr's as above, except that now, the cycles used are between the successive times 
the chain reaches a new high. In other words, T n is the hitting time of n + 1, a n is the number of 
new types born during a first passage cycle from n to n + 1 and S n is the number of new types 
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seen up to time T n . Of course, the u's and t's are no longer identically distributed. However, 
(ri, cji), (T2, (T2), ... are independent. The key to the proof is the following Lemma. 

Lemma 2. Assume that A > 1. Then e~^ A_1 ^A r (t) is almost surely bounded. 

Proof of Lemma 2. Our first step in this proof is to estimate the first two moments of r n . Fol- 
lowing Keilson (1979) (see (5.1.2)) we note that r ra has the same distribution as 

A + F(r n _! + r' n ) forn > 2, (4) 



(1 + A)n 

where X has a mean 1 exponential distribution, r n has the same distribution as r n , 7 is a 
Bernoulli random with P(Y = 1) = -^j-j-, and X, Y, r n _i and T' n are independent. Letting 
pi, n = Er n , it follows from (4) that 

-V^n = — I- Hn-i for n > 2 and [L\ = — . (5) 
n A 

We will use the following recursion formula, which is easy to prove by induction. 

Lemma 3. Let a n and b n be two sequences of real numbers such that a\ = A^ 1 ^! and for n > 2, 



\a n = b n + a n -\. Then, 



Applying Lemma 3 to (5) we get 



i n = A j b n+ i-j. 



n 



Writing 1/(A — 1) as a geometric series we have 



1 ^ . „■ , 1 1.1 



00 



tin ~ 



t—-L)-L V A"^ 

n -I- 1 — 1 n 71 ^ — J 



n(X — 1) ^ n + 1 — j n n 

y ' 3=1 J 3=n+l 
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Changing the order of summation gives 



oo oo 



y y x -i( — l — --) = y \-* V( — l — - -). 

n=lj=l J j=l n=j 

Note that for j > 2 



Y( — -) = Y- 

^;n + l-j n' ^ k 



+ 1—7 

n=j J k=l 



and this term is for j = 1. Hence, 



We conclude that 



and 



00..00 oo j-1 . 

E;£*-'-£*-'Es- 

n=l j=n+l j=2 k=l 



n=l 



™ ' n(A — 1) 



n=l 



Therefore, 



1 1 

E(T n ) — — — - y^ — converges to 0. (6) 
~~ k=i 

We also need an almost sure result for T n , and for this, we will estimate the second moment 
of r n . Let v n = Var(r n ). It is easy to check that if Y is a Bernoulli random variable and is 
independent of a random variable Z then 

Var(ZY) = E{Y)Var{Z) + Var(Y)(EZ) 2 . 

Using this remark and (4) we have for n > 2 



+ T—^{Vn + V n -l) + ^ (Hn + Hn^i) 2 



(1 + A) 2 n 2 1 + X y n n ' L ' (1 + A) 
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Therefore, for n > 2 

Aw„ = b n + v n -i (7) 

where 

1 A 2 

K ~ (l + A )n2 + TTA ( ^ + ^- l) • 

Set /xo = 0, then At>i = b±. Hence, Lemma 3 applies to (7), giving 

n 

v n = ^\~ 3 b n+ i-j. (8) 

Since /i n ~ n {\-i) (that is, the ratio converges to 1), b n ~ ^ where C depends on A only. From 
(8) we get 

V n ~ C b n r — , 

n z 

where C depends on A only. This implies the a.s. convergence of the random series XX r « ~~ ^ T n) 
(see for instance Corollary 47.3 in Port (1994)). Therefore the partial sums converge a.s. and 

T n — E(T n ) converges a.s. 

Using (6) we get that T n — ^rlogn converges a.s. and is therefore a.s. bounded. Now use the 
fact that {N(t) >n} = {T n < t} to conclude that N(t) exp(— (A — l)t) is almost surely bounded. 
This concludes the proof of Lemma 2. □ 

We are now ready to complete the proof of Theorem 1 in the supercritical case. Let (Zi)i>i 
be a discrete time random walk starting at that goes to the right with probability A/(A + 1) 
and to the left with probability 1/(A + 1). For every n > 1, let Z^ n be a discrete time random 
walk starting at with the same rules of evolution as except that the random walk Z^ n has 
a reflecting barrier at — n + 1. For every n > 1, the two random walks Zi and Zi^ n are coupled 
so that they move together until (if ever) they hit — n + 1 and thereafter we still couple them so 
that Zi < Zi n for every i > 0. Let U and U n be the hitting times of 1 for the random walks Zi 
and Zi tn , respectively. 
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First note that a new type appears every time there is a birth. Therefore, a n is the number 
of steps to the right of the random walk Z^ n stopped at 1. That is, a n is (1 + U n )/2. We now 
show that U n converges a.s to U. Let 5 > we have 

P(\U n - U\ > 5) < P{U > U n ) < P{Zi = -n + 1 for some i > 1). 

The last probability decays exponentially with n. Therefore, 

Y J P{\Un~U\ >$) <OC. 
n>l 

An easy application of Borel-Cantelli Lemma implies that U n converges a.s. to U. Since U n < U 
the Dominated Convergence Theorem implies that, for every k > 1 the A;th moment of a n 
converges to the kth moment of (1 + U)/2. In particular, Var(a n ) is a bounded sequence. This 
is enough to prove that 

1 n 

— y^((7j — E{<Ji)) converges a.s. to 0; 

71 i=l 

see for instance Proposition 47.10 in Port (1994). Since E(a n ) is a convergent sequence we get 
that S n /n converges a.s. to the limit of E{a n ). 

Since N(t) — ► oo a.s., this strong law of large numbers gives that S N ^/N(t) converges to the 
limiting expectation of a n . This together with Lemma 2 shows that the two terms in (2) converge 
to when we let s = at and t goes to infinity. The proof of Theorem 1 in the supercritical case 
is complete. 

2.3 Case A = 1. 

In this subsection we go back to the notation of section 2.1 where T n be the time of the nth visit 
of the chain to 1. 

Lemma 4. Let A = 1. Then, 

T 

— ► 1 in probability. 

n log n 
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Proof of Lemma 4- When the chain hits 1, it waits a mean 1 exponential time and then jumps 
to 2. Hence, 

n n 



i=i i=i 



where the Xi are independent mean 1 exponential times and Hi are the hitting times of 1 starting 
at 2. The Hi are i.i.d. with distribution function F. From the backward Kolmogorov equation 



L 



1 + s 2 - 2s 
we get 

F(t) = 
W 1 + t 

We now use a weak law of large numbers, see Theorem 2 in VII. 7 in Feller (1971). It is easier to 
redo the short proof rather than check the hypotheses of the Theorem. The key is the following 
consequence of Chebyshev's inequality applied to the truncated random variables: 

n 

P{\ Y j H i -l\>e)<^ r ^s n + n{l-F{p n )) (9) 



nm, 

t=l 



where 

m n = tF'(t)dt and s n = / t 2 F'(t)dt, 
Jo Jo 

see (7.13) in VII.7 in Feller (1971). We will take p n = n v / logra. A little Calculus shows that 

m n ~ log p n ~ logn and s n ~ p n . 
With our choice of p n , n(l — F(p n )) converges to and 

n 

nm n f-f 

i=i 

converges to 1 in probability. This completes the proof of Lemma 4. □ 
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Since the events N(i) > n and T n <t are the same, it follows that 



N{t)*** - 1 (10) 



in probability as 1 1 oo. 

Now, S n /n 2 converges in distribution to a one sided stable law of index \ (see Theorem (7.7) 
in Durrett (2004)). By (10), it follows that S^^/N(t) 2 also has this distributional limit. In fact, 

^N(at) 

converges to Y a 

in the sense of convergence of finite dimensional distributions, where Y a is a stable subordinator 
(increasing stable process) of index 1/2. (Note that independence between the u's and r's is not 
required here, which is good since they are highly dependent. All that is needed is that the limit 
in (10) is constant and that both S n and N(t) are monotone.) So, the limit in (3) is 

lim E(^) = E&) = a. 

To check the final equality, it is enough by monotonicity to verify it for rational a. If a = m/n, 
this boils down to the simple fact that if V\, V n are i.i.d. and positive, then 

E Vi _ 1 

V! + --- + V n n 
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